54 ◾ Bioinformatics
Notice that the name of the reference genome or its URL may change in the future. The
above commands will create the directory “refgenome” where it downloads the human
reference genome and decompresses it. Once the reference genome has been downloaded,
you can use “samtools faidx” to index it as follows:
samtools faidx GRCh38.p13_ref.fna
To read more about this command, run “samtools faidx -h”.
Indeed, for the reference sequence to be indexed by “samtools faidx” command, it must
be in FASTA format and well-formatted, which means that the FASTA sequences contained
in the file must have a unique name or ID in the FASTA defline and the sequence lines of
each sequence must be of the same length. Indexing a reference genome with Samtools
enables efficient access to arbitrary regions within the FASTA file of the reference sequence.
The above “samtools faidx” command creates an index file “GRCh38.p13_ref.fna.fai”
for the reference genome with the same name as that of the reference genome but with
“.fai” appended to the file name. For the FASTA file, an fai index file is a text file consist-
ing of lines, each with five TAB-delimited columns, including NAME (name of this refer-
ence sequence), LENGTH (length of sequence), OFFSET (sequence’s first base in bytes),
LINEBASES (the number of bases on each line), and LINEWIDTH (the number of bytes
in each line) as shown in Figure 2.4.
Remember that before you use a reference genome with any aligner, you must index it
with “samtools faidx” as above, and the FASTA file and the index file must be in the same
directory. In some reference genome sequence, the sequence names are labeled by chromo-
somes (e.g., chr1) instead of accession numbers.
In the following, we will discuss the commonly used algorithms for read alignments
and the popular aligners.
FIGURE 2.4 Part of the fai index file of the human reference genome.